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Abstract 

In recent years, research efforts to extend linear metric learning models to han¬ 
dle nonlinear structures have attracted great interests. In this paper, we propose a 
novel nonlinear solution through the utilization of deformable geometric models 
to learn spatially varying metrics, and apply the strategy to boost the performance 
of both fcNN and SVM classifiers. Thin-plate splines (TPS) are chosen as the 
geometric model due to their remarkable versatility and representation power in 
accounting for high-order deformations. By transforming the input space through 
TPS, we can pull same-class neighbors closer while pushing different-class points 
farther away in fcNN, as well as make the input data points more linearly sep¬ 
arable in SVMs. Improvements in the performance of fcNN classification are 
demonstrated through experiments on synthetic and real world datasets, with com¬ 
parisons made with several state-of-the-art metric learning solutions. Our SVM- 
based models also achieve significant improvements over traditional linear and 
kernel SVMs with the same datasets. 


1 Introduction 

Many machine learning and data mining algorithms rely on Euclidean metrics to compute pair-wise 
dissimilarities, which assign equal weight to each feature component. Replacing Euclidean metric 
with a learned one from the inputs can often significantly improve the performance of the algorithms 
tH G). Based on the form of the learned metric, metric learning (ML) algorithms can be categorized 
into linear and nonlinear groups m. Linear models 0 ED 0 0 0 0 commonly try to estimate a 
“best” affine transformation to deform the input space, such that the resulted Mahalanobis distance 
would very well agree with the supervisory information brought by training samples. Many early 
works have focused on linear methods as they are easy to use, convenient to optimize and less 
prone to overfitting 11- However, when handling data with nonlinear structures, linear models 
show inherently limited expressive power and separation capability — highly nonlinear multi-class 
boundaries often can not be well modeled by a single Mahalanobis distance metric. 

Generalizing linear models for nonlinear cases have gained steam in recent years, and such 
extensions have been pushed forward mainly along kernelization mmm and localization 
El El QM2 OS directions . The idea of kernelization Ena is to embed the input features 
into a higher dimensional space, with the hope that the transformed data would be more linearly 
separable under the new space. While kernelization may dramatically improve the performance of 
linear methods for many highly nonlinear problems, solutions in this group are prone to overfitting 
HI, and their utilization is inherently limited by the sizes of the kernel matrices Q7). Localiza¬ 
tion approaches focus on combining multiple local metrics, which are learned based on either local 
neighborhoods or class memberships. The granularity levels of the neighborhoods vary from per- 
partition mm. per-class El to per-exemplar mm. a different strategy is adopted in the 
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GB-LMNN method lH8l , which learns a global nonlinear mapping by iteratively adding nonlinear 
components onto a linear metric. At each iteration, a regression tree of depth p splits the input space 
into 2 P axis-aligned regions, and points falling into the regions are shifted in different directions. 
While the localization strategies are usually more powerful in accommodating nonlinear structures, 
generalizing these methods to fit other classifiers than A:NN is not trivial. To avoid non-symmetric 
metrics, extra cares are commonly needed to ensure the smoothness of the transformed input space. 
In addition, estimating geodesic distances and group statistics on such metric manifolds are often 
computationally expensive. 

Most of the existing ML solutions are designed based on pairwise distances, and therefore best 
suited to improve nearest neighbor (NN) based algorithms, such as fc-NN and fc-means. Typically, 
a two-step procedure is involved: a best metric is first estimated through training samples, followed 
by the application of the learned metric to the ensuing classification or clustering algorithms. Since 
learning a metric is equivalent to learn a feature transformation m, metric learning can also be 
applied to SVM models, either as a preprocessing step m, or as an input space transformation 
ffT9ll20l(2T1l . In fl9l , Xu el al. studied both approaches and found applying linear transformations 
to the input samples outperformed three state-of-the-art linear ML models utilized as preprocessing 
steps for SVMs. Several other transformation-based models |20.;l21j have also reported improved 
classification accuracies over the standard linear and kernel SVMs. However, all the models employ 
linear transformations, which limit their capabilities in dealing with complex data. 

In light of the aforementioned limitations and drawbacks of the existing models, we propose a new 
nonlinear remedy in this paper. Our solution is a direct generalization of linear metric learning 
through the application of deformable geometric models to transform the entire input space. In 
this study, we choose thin-plate splines (TPS) as the transformation model, and the choice is with 
the consideration of the compromise between computational efficiency and richness of description. 
TPS are well-known for their remarkable versatility and representation power in accounting for high- 
order deformations. We have designed TPS-based ML solutions for both A:NN and SVM classifiers, 
which will be presented in next two sections. To our best knowledge, this is the first work that 
utilizes nonlinear dense transformations, or spatially varying deformation models in metric learning. 
Our experimental results on synthetic and real data demonstrate the effectiveness of the proposed 
methods. 


2 Nonlinear Metric Learning for Nearest Neighbor 

Many linear metric learning models are formulated under the nearest neighbor (NN) paradigm, with 
the same goal that the estimated transformation would pull similar data points closer while pushing 
dissimilar points apart. Our nonlinear ML model for NN is designed with the same idea. However, 
instead of using a single linear transformation, we choose to deform the input space nonlinearly 
through powerful radial basis functions - thin-plate splines (TPS). With TPS, nonlinear metrics are 
computed globally, with smoothness ensured across the entire data space. Similarly as in linear 
models, the learned pairwise distance is simply the Euclidean distance after the nonlinear projection 
of the data through the estimated TPS transformation. 

In this section, a pioneer Mahalanobis ML for clustering method (MMC) proposed by Xing el al. 
|3l will be used as the platform to formulate our nonlinear ML solution for NN. Therefore, we will 
briefly review the concept of MMC first. Then we will describe the theoretical background of the 
TPS in the general context of transformations, followed by the presentation of our proposed model. 

2.1 Linear Metric Learning and MMC 

Given a set of training data instances X = {x, | x,; £ IR d , i = 1, • • • , n}, where n is the number of 
training samples, and d is the number of features that a data instance has, the goal of ML is to learn a 
“better” metric function D : X x X —> IR to the problem of interest with the information carried by 
the training samples. Mahalanobis metric is one of the most popular metric functions used in existing 
ML algorithms |4|[5j EliillEDEO, which is defined by = -y 1 (x^ — Xj) T M(x.; — Xj). 

The control parameter Al £ p dxd j s a square matrix. In order to qualify as a valid (pseudo-)metric, 
Al has to be positive semi-definite (PSD), denoted as Ad A 0. As a PSD matrix, Al can be decom¬ 
posed as Al = L t L, where L £ [R fexd and k is the rank of Al. Then, Xj) can be rewritten 
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as follows: 


D m (x,, Xj) = y/ (x* - Xj) t L t L(x ! - Xj) = \/(Lx; - Lx J ) T (Lx i - Lxj). (1) 

Eqn. (JTJ) explains why learning a Mahalanobis metric is equivalent to learning a linear transformation 
function and computing the Euclidean distance over the transformed data domain. 

With the side information embedded in the class-equivalent constraints V = 
{(x ? ;. x,) x 7 and x y belong to the same class} and class-nonequivalent constraints AT = {(xj, x ? ) 

| x, and Xj belong to different classes}, MMC model formulates the problem of ML into the 
following convex programming problem: 


min J(M) = ^ D]h(xi,Xj) s.t. M X 0, ^ D^(x;,x.,) > 1. (2) 

Xi,Xj£V Xi ,Xj GA/” 

The objective function aims at improving the subsequent NN based algorithms via minimizing the 
sum of distances between similar training data, while keeping the sum of distances between dissim¬ 
ilar ones large. Note that, besides the PSD constraint on M, an additional constraint on the training 
samples in A f is needed to avoid trivial solutions for the optimization. To solve this optimization 
problem, the projected gradient descent method is used, which projects the estimated matrix back to 
the PSD group whenever it is necessary. 


2.2 TPS 


Thin-plate splines (TPS) are the high-dimensional analogs of the cubic splines in one dimension, 
and have been widely used as an interpolation tool in the research of data approximation, surface 
reconstruction, shape alignments, etc. When it is utilized to align a set of n corresponding point- 
pairs u, and Vj, (* = 1,..., n), a TPS transformation is a mapping function /(x) : IR d —► [R d within 
a suitable Hilbert space 7~L, that matches u, and vr, as well as minimizes a smoothness TPS penalty 
functional J^(/) : H — > IR (will be given in Eqn. [5]>. 

Typically, the problem of finding / can be decomposed into d interpolation problems, finding com¬ 
ponent thin plate splines fk,k = 1..... c/, separately. Suppose the unknown interpolation function 
fk : (R d —> IR belongs to the Sobolev space r H m (f2), where m is an unknown positive integer and 
17 is an open subset of IR d , TPS transformations minimize the smoothness penalty functional of the 
following general form: 


J d m {f) = J\[D m f\\ 2 dX = Y, 


a i_|- \-a ( j i =m 


ai!. 


w-/< 


gmf 


a 

OX-i 


..d X y 


) 2 d*i... dx d 


(3) 


where V m f is the matrix of m-th order partial derivatives of /, with cik being positive, and dX = 
d.x'i ...d./:,i, where x ;] are the components of x. The penalty functional is the generalized form for 
the space integral of the squared second order derivatives of the mapping function. We will suppose 
the mapping function / £ C, a space of functions whose partial derivatives of total order m are in 
7^2(lR d ). To have the evaluation functionals bounded in C, we need C to be a reproducing kernel 
Hilbert space (r.k.h.s.), endowed with the seminorm ./(}(/). For this, it is necessary and sufficient 
that 2m — d > 0. The null space of J^if) consists of a set of polynomial functions with 
maximum degree of (m — 1), so the dimension of this null space is N 0 = (d+m— l)!/(d!(m— 1)!). 

The main problem of TPS is that No, the dimension of the null space, increases exponentially with 
d due to the requirement of 2m — d > 0. To solve this problem, Duchon j23l proposed to replace 
./}}(/) by its weighted squared norm in Fourier space. Since the Fourier transform, denoted as J r (.) 
is isometric, the penalty functional J^(f) can be replaced by its squared norm in Fourier space: 

I\\V m f(t))\\ 2 dX *=> IT{V m f{r))\\ 2 dT (4) 


By adding a weighting function, Duchon introduced a new penalty functional to solve the exponen¬ 
tial growth problem of the dimension for TPS’ null space, which is defined as 

<.(/) = I |r| 2s ||X(© m /(r))|| 2 dr, (5) 
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provided that 2 (m + s) — d > 0 and 2s < d. As suggested by lf23l . one can select an appropriate s to 
have a lower dimension for the null space of -I'l s (/), with the maximum degree of the polynomial 
functions <j) ms spanned in this null space being decreased to 1. 

The classic solution of Eqn. ([5]) has a representation in terms of a radial basis function (TPS inter¬ 
polation function), 

n 

/fc( x ) =X^G(||x-x i ||)+£ T x + c, ( 6 ) 


where ||.|| denotes the Euclidean norm and { p ; t } are a set of weights for the nonlinear part; t and 
c are the weights for the linear part. The corresponding radial distance kernel of TPS, which is the 
Green’s function to solve Eqn. (|5j, is as follows: 


G(x,Xi) 


G(||x-x 4 ||) oc 


f||x — x i || 2(m+s ' ) d ln||x — Xi11, if 2(m + s) — d is even; 
\||x-x ! || 2(m+s) - [i , otherwise. 


( 7 ) 


For more details about TPS, we refer readers to l23l[24l . 


2.3 TPS Metric Learning for Nearest Neighbor (TML-NN) 

The TPS transformation for point interpolation, as specified in Eqn. can be employed as the 
geometric model to deform the input space for nonlinear metric learning. Such a transformation 
would ensure certain desired smoothness as it minimizes the bending energy J^(/) in Eqn. (|3j. 
Within the metric learning setting, let x be one of the training samples in the original feature space 
X of d dimensions, and /(x) be the transformed destination of x, also of d dimensions. Through a 
straightforward mathematical manipulations (25l . we can get /(x) in matrix format: 

/G(x,x i)\ 

/ (x) = Lx + I = Lx + ’kG(x), (8) 

\G(x,x p )y 

where L (size d x d) is a linear transformation matrix, corresponding to L in Mahalanobis metric, 
T (size d x p) is the weight matrix for the nonlinear parts, and p is the number of anchor points 
(xi,..., x p ) to compute the TPS kernel. Usually, we can use all the training data points as the 
anchor points. However, in practice, p anchor points are extracted via different methods to describe 
the whole input space under the consideration of computational cost, such as k-medoids method 
used in Da. 

The goal of our ML solution is also pulling the samples of the same class closer to each other while 
pushing different classes further away, directly through a TPS nonlinear transformation as described 
in Eqn. <©■ This can be achieved through the following constrained optimization: 

min J= II/( x O-/( x j)I | 2 + a II^IIf 

Xi,Xj 

V P ( 9 ) 

s.t. Y, ll/( x i) - /( x j)H 2 > !; I>- fe = 0 ’ E*?*? =°, Vfc = l...d. 

x i ,Xj £j\f i =1 i =1 

/ is in the form of Eqn. (jiiji; is the kth column of T; x fc is the fcth component of x. Compared with 
MMC, another component ||\I r |||,, the squared Frobenius norm of , is added to the objective func¬ 
tion as a regularizer to prevent overfitting. A is the weighting factor to control the importance of two 
components. Similarly as in MMC, the nonequivalent constraint ; . X7 &a HI/0c) - /( x i)ll 2 > i 
is to impose a scaling control to avoid trivial solutions. The other two equivalent constraints with 
respect to T is to ensure that the elastic part of the transformation is zero at infinity (26]. 

Due to the nonlinearity of TPS, it is difficult to analytically solve this nonlinear constrained problem. 
Alternatively, we can use a gradient based constrained optimization solver^jto get a local minimum 
for Eqn. (j9j) . The complexity of our TML-NN model is dominated by the computation of the TPS 
kernel, which is 0(p*n 2 ), as well as the rate of convergence of the chosen gradient based optimizer. 
n is the number of training samples, and p is the number of anchor points. 

*We use a SQP based constrained optimizer “fmincon” in Matlab Optimization Toolbox. 
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To demonstrate the ability of 
TML-NN in handling nonlin¬ 
ear cases, we conducted a simi¬ 
lar experiment used in the GB- 
LMNN method fH). Fig. |T|(a) 
shows a synthetic dataset con¬ 
sisting of inputs sampled from 
two concentric circles (in blue 
dots and red diamonds), each of 
which defines a different class 
membership. Global linear 
transformations in linear met¬ 
ric learning are not sufficient to 
improve the accuracy of fcNN 
(k = 1) classification on this 
data set. As contrast, by utiliz¬ 
ing TPS to model the underly- 
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Figure 1: (a) original inputs with coordinate grids; (b) transformed 
data and the deformation field generated by TML-NN. 


ing nonlinear transformation, as shown in Fig. |T](b), we can easily enlarge the separation between 
outer and inner circles, leading to improved classification rate (would be 100% for INN). 


3 TPS Metric Learning for Support Vector Machines (TML-SVM) 


In this section, we present how to generalize our TPS metric learning model for SVMs. Similar as 
in |20l , we formulate our model under the Margin-Radius-Ralio bounded SVM paradigm, which 
generalizes the traditional SVMs by bounding the estimation error |[27l . Given training dataset 
X = {xj| Xi £ IR d ,i = 1 , ,n} together with the class label information yi £ {—1,+1}, 

our proposed TML-SVM aims to simultaneously learn the nonlinear transformation as described in 
Eqn. ([8]) and a SVM classifier, which can be formulated as follows: 

1 n 

min J = Hwf + Gx^G + ^ll'&lll 

L.'P, w,6 2 z —' 

i=l 

s.t. y, (w T /(x, ; ) + b) = j/i(w T (Lxi + »I>G(xi)) + b) > 1 - Vi = 1... n\ (I) 

ii > 0, Vi = 1... n; (II) (10) 

||/(xi) — x c || 2 = ||Lxi + ^G(xi) — x c |[ 2 <1, Vi = 1.. ,n; (III) 

P P 

J2 = 0, = 0, Vfc = 1... d. (IV) 


The objective function combines the regularizer w.r.t. T for TPS transformation with the traditional 
soft margin SVMs. C\ and C 2 are two trade-off hyper-parameters. The first two nonequivalent 
constraints (I and II) are the same as used in traditional SVMs. The third nonequivalent constraint 
(III) is a unit-enclosing-ball constraint, which forces the radius of minimum-enclosing-ball to be 
unit in the transformed space and avoids trivial solutions. x c is the center of all samples. In practice, 
we can simplify the unit-enclosing-ball constraint to ||/(x;)|| 2 < 1 through a preprocessing step to 
centralize the input data: x* ■£- Xj — ^ Y2i =1 x »> The last two equivalent constraints are used to 
maintain the properties for TPS transformation at infinity, similar as in Eqn. <|9}. 

To solve this optimization problem, we propose an efficient EM-like iterative minimization al¬ 
gorithm by updating {w, b} and {L,\l/} alternatively. With {L, d/} fixed, /(x,) is explicit, and 
Eqn. (p~0]> can be reformulated as: 


min 

w.b 


T = ^t|w|j 2 + Gi s.t. y l (w T /(x 1 ) + b) > 1 - G, £i > 0 , Vi = 1...n. 


(ID 


This becomes exactly the primal form of soft margin SVMs, which can be solved by off-the-shelf 
SVM solvers. With {w, b} fixed, Eqn. (10 1 can be reformulated as: 
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min J = Cl ^ + C 2 ml 
’ 2=1 

s.t. yi(-w T f(xi) + b) > 1 - £i, & > 0, Vi = 1... n; 

||/(xi)|| 2 < 1, Vi = 1... n; ^ ^ = 0, ^ ^xf = 0, Vfc = 1... d. 

2=1 2=1 


By using hinge loss function, we can eliminate variables and reformulate Eqn. (12 1 as: 


min 

L,V 

s.t. 


J = Ci^max[0,1 - yi(w T f(xi) + 6)] 2 + C 2 ||^||f 
2 = 1 


||/(xi)|| 2 <1, \/i = 1.. .n; 



p 

= 0, Vfc = 1... d. 

2 = 1 


(12) 


(13) 


As the squared hinge loss function is differentiable, it is not difficult to differentiate the objective 
function w.r.t L and if. Similarly as in solving Eqn. & we can also use a gradient based optimizer 
to get a local minimum for Eqn. ( fl3j ), with the gradient computed as: 

f) 1 n _ 

— = - 2 C'i\P ^ max[0, 1 - yi(w T f(x.i) + i)](j/,wG T (x,)) + 2C'2'E 

^ ( 14 ) 

— = - 2C'i#y^max[0,1 - yi(w T /(xi) + b)\(yi wxf) 

2 = 1 

To sum it up, the optimal nonlinear transformation defined by { L. vf} along with the optimal SVM 
classifier coefficients {w, b} can be obtained by an EM-like iterative procedure, as described in 
Algorithm [I] 


Algorithm 1 TPS Metric Learning for SVM (TML-SVM) 

Input: training dataset X = {x, x, g [R d , i = 1, • • • , n}, 
class label information y t g {—1, +1} 

Initialize: T = 0, L = I 


Centralize the input data: x, : t— x, — 1 YHl= i x * 

Iterate the following two steps: 

(1) Update {w, b} with fixed {L, \f} : 

Compute the transformed data /(x,) by following Eqn. ([8]) 

Update {w, h} by using off-the-shelf SVM solver with input of /(x,) 

(2) Update {T, v f} with fixed jw, bj : 

Update {L, } by solving Eqn. (13 i through gradient based optimizers!^] 

until convergence 


Output: the optimal SVM classifier defined by {w, b\, 

the nonlinear TPS transformation defined by { L. if} 


3.1 Kernelization of TML-SVM 

TML-SVM can be kernelized through a kernel principal component analysis (KPCA) based frame¬ 
work, as introduced in msi nu. Unlike the traditional kernel trick |29l , which often involves the 
derivation of new mathematical formulas, KPCA based framework provides an alternative choice 
that can directly utilize the original linear models. Typically, it consists of two simple stages: first, 
map the input data into a kernel feature space introduced by KPCA; then, train the linear model in 
this kernel space. Proved to be equivalent to the traditional kernel trick, this KPCA based framework 
also provides a convenient way to speed up a learner, if a low-rank KPCA is used. Through this pro¬ 
cedure, kernelized TML-SVM can be easily realized by directly utilizing Algorithm[I]in the mapped 
KPCA space. For more details about this KPCA-based approach, we refer readers to mm. 

2 We still use “fmincon” in Matlab to solve Eqn. ( |l3} . In practice, the convergence for the second inner step 
is not necessary, so we use an early stop strategy to speed up the whole algorithm. 
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4 Experimental Results 


In this section, we present evaluation and comparison results of applying our proposed TPS-based 
nonlinear ML methods on seven widely used datasets from UCI machine learning repository. The 
details of these datasets are summarized in the leftmost column of Table Q] The three numbers in¬ 
side the bracket indicate data size, feature dimension, and number of classes for the corresponding 
dataset. All datasets have been preprocessed through normalization. To demonstrate the effec¬ 
tiveness of our proposed nonlinear metric learning method, we firstly choose fcNN method as the 
baseline classifier, and compare the improvements made by TML-NN against five state-of-the-art 
NN based metric learning methods; then, similar experiments are conducted to show improvements 
made by our proposed TML-SVM over the traditional S VMs. 


4.1 Comparisons with NN based ML solutions 

The first set of experiments are within the Nearest Neighbor (NN) category. We choose fc = 1 in 
fcNN, and the five competing metric learning methods are: Large Margin Nearest Neighbor classi¬ 
fication (LMNN) Q, Information-Theoretic Metric Learning (ITML) J6), Neighborhood Compo¬ 
nents Analysis (NCA) j4j, GB-LMNN lfl8l and Parametric Local Metric Learning (PLML) [16|. 
The hyper-parameters of NCA, ITML, LMNN and GB-LMNN are set by following ® j§] 7] El 
respectively. PLML has a number of hyper-parameters, so we follow the suggestion of lfl6l : use a 
3-fold CV to select a 2 from {0.01 ~ 1000}, and set the other hyper-parameters by its default. In our 
TML-NN model, there are two hyper-parameters: the number of anchor points p and the weighting 
factor A. For p, we empirically set it to 30% of the training samples; for A, we select it through CV 
from {5 -5 ~ 5 25 }. 


Table 1: Mean and standard deviation of fcNN based classification accuracy results on seven UCI 
datasets. Boldface denotes the highest classification accuracy for each dataset. The superscripts 
= in TML-NN column indicate a significant win, loss or no difference respectively based on the 
pairwise Student’s f-test with the other six methods. The number in the parenthesis denotes the score 


of the respective method for the given dataset. 


Datasets 

fcNN 

LMNN 

ITML 

NCA 

PLML 

GB-LMNN 

TML-NN 

Iris 

95.70 ±2.31 

95.06 ±2.62 

95.22 ±2.56 

94.68 ± 2.35 

84.22 ±4.54 

95.15 ±2.17 

96.49 ± 2.32 

[150/4/3] 

(3.5) 

(3.0) 

(3.0) 

(2.5) 

(0) 

(3.0) 

++++++ (6.0) 

Wine 

95.21 ±2.04 

97.25 ± 1.80 

96.90 ±2.31 

96.65 ±2.27 

96.61 ±2.10 

96.80 ± 1.94 

97.28 ± 2.07 

[178/13/3] 

(0.0) 

(4.5) 

(3.5) 

(2.5) 

(2.5) 

(3.5) 

+ 

11 

11 

+ 

+ 

II 

4 ^ 

Ln 

Breast 

95.35 ± 1.34 

95.66 ± 1.39 

95.76 ±1.30 

95.57± 1.13 

96.18 ± 0.98 

96.04 ± 1.22 

95.97 ± 1.04 

[683/10/2] 

(1.0) 

(2.0) 

(2.5) 

(1.5) 

(5.0) 

(5.0) 

+==+= = (4.0) 

Diabetes 

70.58 ± 2.26 

70.54 ±2.52 

68.81 ± 2.65 

68.53 ±2.71 

69.04 ±2.30 

70.62 ±2.23 

71.54 ± 2.21 

[768/8/2] 

(4.0) 

(4.0) 

(1.0) 

(1.0) 

(1.0) 

(4.0) 

++++++ (6.0) 

Liver 

61.20 ±3.96 

60.79 ±3.54 

60.07 ±4.92 

62.63 ±4.15 

64.74 ±3.99 

64.81 ± 3.80 

64.97 ± 4.28 

[345/6/2] 

(1.0) 

(1.0) 

(1.0) 

(3.0) 

(5.0) 

(5.0) 

++++== (5.0) 

Sonar 

84.73 ± 3.45 

84.12 ±4.13 

82.14 ±5.94 

85.46 ±3.51 

87.42 ±4.70 

85.48 ± 4.04 

85.35 ±3.82 

[208/60/2] 

(3.0) 

(1.5) 

(0) 

(3.5) 

(6.0) 

(3.5) 

= ++ =-= (3.5) 

Ionosphere 

85.83 ± 2.62 

88.40 ±2.54 

87.45 ±3.07 

88.33 ±2.77 

91.03 ± 2.23 

89.47 ±2.70 

88.79 ± 2.37 

[351/34/2] 

(0) 

(3.0) 

(1.0) 

(3.0) 

(6.0) 

(4.5) 

+=+ =-= (3.5) 

Total Score 

12.5 

19.0 

12.0 

17.0 

25.5 

28.5 

32.5 


To better compare the classification performance, we run the experiment 100 times with different 
random 3-fold splits of each dataset, two for training and one for testing. Furthermore, we conduct 
a pairwise Student’s f-test with a p-value 0.05 among the seven methods for each dataset. Then, a 
ranking schema from m is used to evaluate the relative performance of these algorithms: a method 
A will be assigned 1 point if it has a statistically significantly better accuracy than another method 
B; 0.5 points if there is no significant difference, and 0 point if A performs significantly worse than 
B. The experiment results by averaging over the 100 runs are reported in Table|T| 

From Table [T] we can see that TML-NN outperforms the other six methods in a statistically signif¬ 
icant manner, with a total score of 32.5 points. Out of the total 42 pairwise comparisons, TML-NN 
has 25 statistical wins. Furthermore, it has significantly improved the baseline method, fcNN, on 
six datasets out of the total seven, and performed equally well on the seventh (“Sonar”). It is also 
worth pointing out that the proposed nonlinear TML-NN has 14 wins and no loss out of the total 
18 comparisons against the linear ML solutions (LMNN, ITML, NCA); against the local nonlinear 
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ML solutions (PLML, GB-LMNN), TML-NN has five wins and only two loss out of the total 14 
comparisons. 


4.2 Improvements over SVMs 


To verify the effectiveness of our proposed nonlinear metric learning for SVMs, we conduct another 
set of experiments on the same seven UCI datasets to compare the following four SVM models: 
linear SVM (l- SVM), kernel SVM (r-SVM), our proposed TML-SVM and kernel TML-SVM. For 
/-SVM, we directly utilize the off-the-shelf LIBSVM solver (30), for which the slackness coeffi¬ 
cient C are tuned through 3-fold CV from {2~ 5 ~ 2 15 }. For r-SVM, we choose the Gaussian 
kernel and select the kernel width cr through CV from {d m i n ~ 20d m i„}, where d m i n is the mean 
of the distances between a input data to its nearest neighbor. TML-SVM has three hyper-parameters 
to be tuned: the number of anchor points p and the tradeoff coefficients C\ and C-i ■ For p, we 
still empirically set it to 30% of the training samples; for C\ and C’>, we select them through CV 
from {2 -5 ~ 2 15 } and {5 -5 ~ 5 25 } respectively. In kernel TML-SVM, we use the same Gaus¬ 
sian kernel width cr selected in r-SVM for each dataset, and tune the other parameters C\ and 
C 2 similarly as in TML-SVM. To deal with multi-class classification, we apply the “one-against- 
one” strategy on top of binary TML-SVM and kernel TML-SVM, the same as used in LIBSVM. 


Table 2: Mean and standard deviation of SVMs based classification 
accuracy results on seven UCI datasets. The settings and notations 
of the comparison scores are identical to those in Table 1. 


We adopt the same experi¬ 
mental setting and statistical 
ranking scheme as in the NN 
based classification, and re¬ 
port the results averaged from 
100 runs in Table [2] It is ev¬ 
ident that combining our pro¬ 
posed nonlinear metric learn¬ 
ing has significantly improved 
the performance of both l- 
SVM and r-SVM. To be spe¬ 
cific, TML-SVM outperforms 
l -SVM on all seven datasets; 
kernel TML-SVM also fares 
better than r-SVM on all 
seven datasets. Furthermore, 
it is worth pointing out that 
TML-SVM has significantly 
improved l- SVM’s classifica¬ 
tion rates, performing better than or comparable to r-SVM on five datasets (“Iris”, “Wine”, “Breast”, 
“Diabetes”, and “Liver”). 


Datasets 

l-SVM 

TML-SVM 

r-SVM 

kernel TML-SVM 

Iris 

95.94 ± 2.42 
-=“ (0.5) 

96.67 ±2.31 
+==(2.0) 

96.09 ±2.34 
==-(1.0) 

96.81 ± 2.43 

+=+(2.5) 

Wine 

97.20 ± 1.86 
— (0) 

98.97 ± 1.25 

+++(3.0) 

98.07 ± 1.80 
+—(1.0) 

98.46 ± 1.46 
+-+(2.0) 

Breast 

96.73 ±0.97 
— (0) 

97.15 ± 0.88 
+=“(L5) 

97.06 ± 0.83 
+=-(1.5) 

97.44 ± 0.98 

+++(3.0) 

Diabetes 

76.66 ±2.18 
-=“ (0.5) 

77.24 ±1.92 
+==(2.0) 

77.07 ± 2.05 
==-(1.0) 

77.69 ± 2.20 

+=+(2.5) 

Liver 

69.06 ± 3.79 
— (0) 

72.62 ±3.13 
+==(2.0) 

72.35 ± 3.76 
+=-(1.5) 

73.40 ± 3.58 

+=+(2.5) 

Sonar 

75.78 ±4.16 
— (0) 

82.16 ±3.79 
+—(1.0) 

84.96 ± 4.28 
++-(2.0) 

86.54 ± 3.47 

+++(3.0) 

Ionosphere 

87.75 ± 2.42 
— (0) 

92.94 ±2.06 
+—(1.0) 

94.36 ± 1.87 
++-(2.0) 

95.12 ± 1.72 

+++(3.0) 

Total Score 

1.0 

12.5 

10.0 

18.5 


5 Conclusion 

In this paper, we present two nonlinear metric learning solutions, for fcNN and SVMs respectively, 
based on geometric transformations. The novelty of our approaches lies in the fact that it generalizes 
the linear or piecewise linear transformations in traditional metric learning solutions to a globally 
smooth nonlinear deformation in the input space. The geometric model used in this paper is thin- 
plate splines, and it can be extended to other radial distance functions. To explore other types of 
geometric models from the perspective of conditionally positive definite kernels is the direction of 
our future efforts. We are also interested in investigating a more efficient numerical optimization 
scheme (or the analytic form) for the proposed TPS based methods. 
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